Load the diamonds data set, as well as ggplot2

setwd("/Users/Nick/Documents/udacity/projects/data_analysis_with_r/")
library(ggplot2)
library(gridExtra)
library(GGally)
library(scales)
library(memisc)
## Loading required package: lattice
## Loading required package: MASS
## 
## Attaching package: 'memisc'
## 
## The following object is masked from 'package:scales':
## 
##     percent
## 
## The following objects are masked from 'package:stats':
## 
##     contr.sum, contr.treatment, contrasts
## 
## The following objects are masked from 'package:base':
## 
##     as.array, trimws
library(RColorBrewer)
library(RCurl)
## Loading required package: bitops
library(bitops)
data(diamonds)

Create a scatterplot of price vs x for the diamonds data set

ggplot(data=diamonds, aes(x=x, y=price)) + geom_point()

Testing correlations between price and x/y/z

with(diamonds, cor.test(x, price))
## 
##  Pearson's product-moment correlation
## 
## data:  x and price
## t = 440.16, df = 53938, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8825835 0.8862594
## sample estimates:
##       cor 
## 0.8844352
with(diamonds, cor.test(y, price))
## 
##  Pearson's product-moment correlation
## 
## data:  y and price
## t = 401.14, df = 53938, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8632867 0.8675241
## sample estimates:
##       cor 
## 0.8654209
with(diamonds, cor.test(z, price))
## 
##  Pearson's product-moment correlation
## 
## data:  z and price
## t = 393.6, df = 53938, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8590541 0.8634131
## sample estimates:
##       cor 
## 0.8612494

Create a simple scatter plot of price vs depth. Add to the code some transparency in the dots (1/100) and mark the x-axis every 2 units.

ggplot(data=diamonds, aes(x=depth, y=price)) + geom_point(alpha=0.01) +
  scale_x_continuous(breaks=seq(40, 80, 2))

The above plot places the vast majority of diamonds between depths of 60 and 64.

What’s the correlation of depth vs. price?

with(diamonds, cor.test(depth, price))
## 
##  Pearson's product-moment correlation
## 
## data:  depth and price
## t = -2.473, df = 53938, p-value = 0.0134
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.019084756 -0.002208537
## sample estimates:
##        cor 
## -0.0106474

The correlation between the two is -0.01. This would be a horrible predictive characteristic of diamond price.

Create a scatterplot of price vs carat, and omit the top 1% of price and carat values.

ggplot(data=diamonds, aes(x=carat, y=price)) + geom_point() + 
  xlim(0, quantile(diamonds$carat, 0.99)) +
  ylim(0, quantile(diamonds$price, 0.99))
## Warning: Removed 926 rows containing missing values (geom_point).

Create a scatterplot of price vs volume (x * y * z). In the process, create a new variable for volume in the diamonds data frame.

diamonds$volume = diamonds$x * diamonds$y * diamonds$z
ggplot(data=diamonds, aes(x=volume, y=price)) + geom_point()

The x-axis needs to be rescaled to exclude the worst outliers (like 3?). Generally though, price trends upward with diamond volume.

What’s the correlation of price and volume? Exclude diamonds that have a volume of 0 or that are greater than or equal to 800.

with(subset(diamonds, (diamonds$volume > 0) & (diamonds$volume < 800)), cor.test(volume, price))
## 
##  Pearson's product-moment correlation
## 
## data:  volume and price
## t = 559.19, df = 53915, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9222944 0.9247772
## sample estimates:
##       cor 
## 0.9235455

Correlation: 0.9235455

Subset the data to exclude diamonds with a volume greater tha or equal to 800. Also, exclude diamonds with a volume of 0. Adjust the transparency of the points and add a linear model to the plot.

Do you think this would be a useful model to estimate the price of diamonds? Why or why not?

diam_sub = subset(diamonds, (diamonds$volume > 0) & (diamonds$volume < 800))
ggplot(data=diam_sub, aes(x=volume, y=price)) + geom_point(alpha=0.01) +
  geom_smooth()
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

This is definitely not a useful model beyond a volume of about 400. Perhaps even before then.

Use the function dplyr package to create a new data frame containing info on diamonds by clarity. Name the data frame “diamondsByClarity”. The data should contain the following variables in this order: - mean_price - median_price - min_price - max_price - n where “n” is the number of diamonds in each level of clarity

library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:nlme':
## 
##     collapse
## 
## The following objects are masked from 'package:memisc':
## 
##     collect, query, rename
## 
## The following object is masked from 'package:MASS':
## 
##     select
## 
## The following object is masked from 'package:GGally':
## 
##     nasa
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
diamondsByClarity <- diamonds %>%
  group_by(clarity) %>%
  summarise(mean_price = mean(price),
            median_price = median(price),
            min_price = min(price),
            max_price = max(price),
            n = n())
diamondsByClarity
## Source: local data frame [8 x 6]
## 
##   clarity mean_price median_price min_price max_price     n
##    (fctr)      (dbl)        (dbl)     (int)     (int) (int)
## 1      I1   3924.169         3344       345     18531   741
## 2     SI2   5063.029         4072       326     18804  9194
## 3     SI1   3996.001         2822       326     18818 13065
## 4     VS2   3924.989         2054       334     18823 12258
## 5     VS1   3839.455         2005       327     18795  8171
## 6    VVS2   3283.737         1311       336     18768  5066
## 7    VVS1   2523.115         1093       336     18777  3655
## 8      IF   2864.839         1080       369     18806  1790

diamonds_mp_by_clarity and diamonds_mp_by_color are summary data frames with the mean price by clarity and color. Create two bar plots on one output image using the grid.arrange() function from the package gridExtra

library(gridExtra)
diamonds_by_clarity <- group_by(diamonds, clarity)
diamonds_mp_by_clarity <- summarise(diamonds_by_clarity, mean_price = mean(price))

diamonds_by_color <- group_by(diamonds, color)
diamonds_mp_by_color <- summarise(diamonds_by_color, mean_price = mean(price))

plot1 = ggplot(data=diamonds_mp_by_clarity, aes(factor(clarity), y=mean_price)) + geom_bar(stat="identity")
plot2 = ggplot(data=diamonds_mp_by_color, aes(x=factor(color), y=mean_price)) + geom_bar(stat="identity")
grid.arrange(plot1, plot2, ncol=2)

What do you notice in each of the bar charts for mean price by clarity and mean price by color? clarity: I1 (worst), SI1, SI2, VS1, VS2, VVS1, VVS2, IF (best) color: J (worst) to D (best)

Generally the better the color, the lower the mean price, which is quite surprising. Similarly, aside from a couple spikes, higher clarity diamonds have lower mean prices.

This is strange. Let’s consider the mean price w.r.t. cut

diamonds_by_cut <- group_by(diamonds, cut)
diamonds_mp_by_cut <- summarise(diamonds_by_cut, mean_price = mean(price))
ggplot(data=diamonds_mp_by_cut, aes(factor(cut), y=mean_price)) + geom_bar(stat="identity")

Create a histogram of diamond prices. Facet the histogram by diamond color and use cut to color the histogram bars

ggplot(data=diamonds, aes(x=price, fill=cut)) + geom_histogram() +
  facet_wrap(~color)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Create a scatterplot of diamond price vs. table and color the points by the cut of the diamond.

ggplot(data=diamonds, aes(x=table, y=price, color=cut)) +
  geom_jitter() + 
  scale_x_continuous(breaks=seq(50, 80, 2), limits=c(50, 80))
## Warning: Removed 6 rows containing missing values (geom_point).

Create a scatterplot of diamond price vs. volume (x * y * z) and color the points by the clarity of diamonds. Use scale on the y-axis to take the log10 of price. You should also omit the top 1% of diamond volumes from the plot.

diamonds$volume = diamonds$x * diamonds$y * diamonds$z
ggplot(data=subset(diamonds, price < quantile(price, probs=0.9) & volume > 0),
       aes(x=volume, y=price, color=clarity)) + 
  geom_jitter() + 
  scale_y_log10(breaks=c(300, 550, 1000, 1700, 3000, 5500, 10000)) +
  xlim(0, 500)
## Warning: Removed 2 rows containing missing values (geom_point).

Create a scatter plot of the price/carat ratio of diamonds. The variable x should be assigned to cut. The points should be colored by diamond color, and the plot should be faceted by clarity.

ggplot(data=diamonds, aes(x=cut, y=price/carat)) +
  geom_jitter(aes(color=color), alpha=0.2) + 
  facet_wrap(~ clarity)

Diamond data of price vs carat with a linear model

ggplot(data=diamonds, aes(x=carat, y=price)) + geom_point(shape=21) +
  xlim(c(0, quantile(diamonds$carat, probs=0.9))) +
  ylim(c(0, quantile(diamonds$price, probs=0.9))) +
  stat_smooth(method="lm")
## Warning: Removed 6369 rows containing missing values (stat_smooth).
## Warning: Removed 6369 rows containing missing values (geom_point).
## Warning: Removed 3 rows containing missing values (geom_path).

All things vs all things!

set.seed(20022012)
diamond_samp = diamonds[sample(1:length(diamonds$price), 10000), ]
ggpairs(diamond_samp, params = c(shape = I("."), outlier.shape = I(".")))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Histograms of price, with one being in log10

plt1 = ggplot(data = diamonds, aes(x = price)) + geom_histogram() +
  ggtitle("Price")
  plt2 = ggplot(data = diamonds, aes(x = price)) + geom_histogram(binwidth=0.01) +
    scale_x_log10(breaks = c(100, 300, 1000, 3000, 10000)) +
    ggtitle("Price (log10)")
grid.arrange(plt1, plt2, ncol=2)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Scatter plot transformations

ggplot(data = diamonds, aes(x = carat, y = price)) + 
  geom_point() +
  scale_y_continuous(trans = log10_trans()) +
  ggtitle("Price")

Create a new function to transform the carat variable

cuberoot_trans = function() trans_new("cuberoot",
                                      transform = function(x) x^(1/3),
                                      inverse = function(x) x^3)

Use the cuberoot_trans function

ggplot(data = diamonds, aes(x = carat, y = price)) +
  geom_point() +
  scale_x_continuous(trans = cuberoot_trans(), limits=c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) +
  scale_y_continuous(trans = log10_trans(), limits=c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle("Price (log10) by Cube-Root of Carat")
## Warning: Removed 1683 rows containing missing values (geom_point).

Overplotting revisited. Plot price vs carat like above, but change the points to add jitter and transparency

# head(sort(table(diamonds$carat), decreasing=T))
# head(sort(table(diamonds$price), decreasing=T))
# 
ggplot(aes(carat, price), data = diamonds) + 
  geom_point(position="jitter", alpha=0.02) + 
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat')
## Warning: Removed 1691 rows containing missing values (geom_point).

Color the points by the diamond’s clarity

ggplot(aes(carat, price), data = diamonds) + 
  geom_point(position="jitter", alpha=0.2, aes(color=clarity)) + 
  scale_color_brewer(type = 'div',
    guide = guide_legend(title = 'Clarity', reverse = T,
    override.aes = list(alpha = 1, size = 2))) +  
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat and Clarity')
## Warning: Removed 1693 rows containing missing values (geom_point).

Now color the points by the diamond’s cut

ggplot(aes(carat, price), data = diamonds) + 
  geom_point(position="jitter", alpha=0.2, aes(color=cut)) + 
  scale_color_brewer(type = 'div',
    guide = guide_legend(title = 'Cut', reverse = T,
    override.aes = list(alpha = 1, size = 2))) +  
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat and Cut')
## Warning: Removed 1696 rows containing missing values (geom_point).

Lastly, color the points by the diamond’s color

ggplot(aes(carat, price), data = diamonds) + 
  geom_point(position="jitter", alpha=0.2, aes(color=color)) + 
  scale_color_brewer(type = 'div',
    guide = guide_legend(title = 'Color', reverse = F,
    override.aes = list(alpha = 1, size = 2))) +  
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat and Color')
## Warning: Removed 1688 rows containing missing values (geom_point).

Create a linear model for price

m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data=diamonds)
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)

mtable(m1, m2, m3, m4, m5)
## 
## Calls:
## m1: lm(formula = I(log(price)) ~ I(carat^(1/3)), data = diamonds)
## m2: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat, data = diamonds)
## m3: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut, data = diamonds)
## m4: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color, 
##     data = diamonds)
## m5: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color + 
##     clarity, data = diamonds)
## 
## ======================================================================
##                     m1         m2         m3         m4         m5    
## ----------------------------------------------------------------------
## (Intercept)      2.821***   1.039***   0.874***   0.932***   0.415*** 
##                 (0.006)    (0.019)    (0.019)    (0.017)    (0.010)   
## I(carat^(1/3))   5.558***   8.568***   8.703***   8.438***   9.144*** 
##                 (0.007)    (0.032)    (0.031)    (0.028)    (0.016)   
## carat                      -1.137***  -1.163***  -0.992***  -1.093*** 
##                            (0.012)    (0.011)    (0.010)    (0.006)   
## cut: .L                                0.224***   0.224***   0.120*** 
##                                       (0.004)    (0.004)    (0.002)   
## cut: .Q                               -0.062***  -0.062***  -0.031*** 
##                                       (0.004)    (0.003)    (0.002)   
## cut: .C                                0.051***   0.052***   0.014*** 
##                                       (0.003)    (0.003)    (0.002)   
## cut: ^4                                0.018***   0.018***  -0.002    
##                                       (0.003)    (0.002)    (0.001)   
## color: .L                                        -0.373***  -0.441*** 
##                                                  (0.003)    (0.002)   
## color: .Q                                        -0.129***  -0.093*** 
##                                                  (0.003)    (0.002)   
## color: .C                                         0.001     -0.013*** 
##                                                  (0.003)    (0.002)   
## color: ^4                                         0.029***   0.012*** 
##                                                  (0.003)    (0.002)   
## color: ^5                                        -0.016***  -0.003*   
##                                                  (0.003)    (0.001)   
## color: ^6                                        -0.023***   0.001    
##                                                  (0.002)    (0.001)   
## clarity: .L                                                  0.907*** 
##                                                             (0.003)   
## clarity: .Q                                                 -0.240*** 
##                                                             (0.003)   
## clarity: .C                                                  0.131*** 
##                                                             (0.003)   
## clarity: ^4                                                 -0.063*** 
##                                                             (0.002)   
## clarity: ^5                                                  0.026*** 
##                                                             (0.002)   
## clarity: ^6                                                 -0.002    
##                                                             (0.002)   
## clarity: ^7                                                  0.032*** 
##                                                             (0.001)   
## ----------------------------------------------------------------------
## R-squared            0.924      0.935      0.939     0.951       0.984
## adj. R-squared       0.924      0.935      0.939     0.951       0.984
## sigma                0.280      0.259      0.250     0.224       0.129
## F               652012.063 387489.366 138654.523 87959.467  173791.084
## p                    0.000      0.000      0.000     0.000       0.000
## Log-likelihood   -7962.499  -3631.319  -1837.416  4235.240   34091.272
## Deviance          4242.831   3613.360   3380.837  2699.212     892.214
## AIC              15930.999   7270.637   3690.832 -8442.481  -68140.544
## BIC              15957.685   7306.220   3761.997 -8317.942  -67953.736
## N                53940      53940      53940     53940       53940    
## ======================================================================

Let’s get an updated diamond data set

load("BigDiamonds.Rda")

Your task is to build five linear models like Solomon did for the diamonds data set only this time you’ll use a sample of diamonds from the diamondsbig data set.

m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data=subset(diamondsbig, (cert == "GIA") & (price < 10000)))
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)

mtable(m1, m2, m3, m4, m5)
## 
## Calls:
## m1: lm(formula = I(log(price)) ~ I(carat^(1/3)), data = subset(diamondsbig, 
##     (cert == "GIA") & (price < 10000)))
## m2: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat, data = subset(diamondsbig, 
##     (cert == "GIA") & (price < 10000)))
## m3: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut, data = subset(diamondsbig, 
##     (cert == "GIA") & (price < 10000)))
## m4: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color, 
##     data = subset(diamondsbig, (cert == "GIA") & (price < 10000)))
## m5: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color + 
##     clarity, data = subset(diamondsbig, (cert == "GIA") & (price < 
##     10000)))
## 
## ===========================================================================
##                     m1          m2          m3          m4          m5     
## ---------------------------------------------------------------------------
## (Intercept)       2.671***    1.333***    0.949***    0.529***   -0.464*** 
##                  (0.003)     (0.012)     (0.012)     (0.010)     (0.009)   
## I(carat^(1/3))    5.839***    8.243***    8.633***    8.110***    8.320*** 
##                  (0.004)     (0.022)     (0.021)     (0.017)     (0.012)   
## carat                        -1.061***   -1.223***   -0.782***   -0.763*** 
##                              (0.009)     (0.009)     (0.007)     (0.005)   
## cut: V.Good                               0.120***    0.090***    0.071*** 
##                                          (0.002)     (0.001)     (0.001)   
## cut: Ideal                                0.211***    0.181***    0.131*** 
##                                          (0.002)     (0.001)     (0.001)   
## color: K/L                                            0.123***    0.117*** 
##                                                      (0.004)     (0.003)   
## color: J/L                                            0.312***    0.318*** 
##                                                      (0.003)     (0.002)   
## color: I/L                                            0.451***    0.469*** 
##                                                      (0.003)     (0.002)   
## color: H/L                                            0.569***    0.602*** 
##                                                      (0.003)     (0.002)   
## color: G/L                                            0.633***    0.665*** 
##                                                      (0.003)     (0.002)   
## color: F/L                                            0.687***    0.723*** 
##                                                      (0.003)     (0.002)   
## color: E/L                                            0.729***    0.756*** 
##                                                      (0.003)     (0.002)   
## color: D/L                                            0.812***    0.827*** 
##                                                      (0.003)     (0.002)   
## clarity: I1                                                       0.301*** 
##                                                                  (0.006)   
## clarity: SI2                                                      0.607*** 
##                                                                  (0.006)   
## clarity: SI1                                                      0.727*** 
##                                                                  (0.006)   
## clarity: VS2                                                      0.836*** 
##                                                                  (0.006)   
## clarity: VS1                                                      0.891*** 
##                                                                  (0.006)   
## clarity: VVS2                                                     0.935*** 
##                                                                  (0.006)   
## clarity: VVS1                                                     0.995*** 
##                                                                  (0.006)   
## clarity: IF                                                       1.052*** 
##                                                                  (0.006)   
## ---------------------------------------------------------------------------
## R-squared             0.888       0.892      0.899       0.937        0.969
## adj. R-squared        0.888       0.892      0.899       0.937        0.969
## sigma                 0.289       0.284      0.275       0.216        0.154
## F               2700903.714 1406538.330 754405.425  423311.488   521161.443
## p                     0.000       0.000      0.000       0.000        0.000
## Log-likelihood   -60137.791  -53996.269 -43339.818   37830.414   154124.270
## Deviance          28298.689   27291.534  25628.285   15874.910     7992.720
## AIC              120281.582  108000.539  86691.636  -75632.827  -308204.540
## BIC              120313.783  108043.473  86756.037  -75482.557  -307968.400
## N                338946      338946     338946      338946       338946    
## ===========================================================================

Let’s use that model to predict diamond price. Example diamond from BlueNile: Round 1.00 Very Good I VS1 $5,601

thisDiamond = data.frame(carat = 1.00, cut = "V.Good", 
                         color = "I", clarity = "VS1")
modelEstimate = predict(m5, newdata = thisDiamond, interval="prediction",
                        level = 0.95)
actualEstimate = exp(modelEstimate)[1]
estimateSig = (exp(modelEstimate)[3] - exp(modelEstimate)[2])/2
truth = 5601